{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Julia for Data Science\n",
    "\n",
    "* Data\n",
    "* **Data processing**\n",
    "* Visualization\n",
    "\n",
    "### Data processing: Standard machine learning algorithms in Julia\n",
    "In what's next, we will see how to use some of the standard machine learning algorithms implemented in Julia."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using DataFrames, Statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 1: Kmeans Clustering\n",
    "\n",
    "Let's start with some data.\n",
    "\n",
    "The Sacramento real estate transactions file that we download next is a list of 985 real estate transactions in the Sacramento area reported over a five-day period,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "download(\"http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv\",\"houses.csv\")\n",
    "houses = readtable(\"houses.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use [`Plots.jl`](https://github.com/JuliaPlots/Plots.jl) to plot with the `pyplot` backend. (NOTE: this can take a long time the first time you run it, when it's initializing the package.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Plots\n",
    "pyplot()\n",
    "plot(size=(500,500),leg=false)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create a scatter plot to show the price of a house vs. its square footage,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = houses[!, :sq__ft]\n",
    "# x = houses[7] # equivalent, useful if file has no header\n",
    "y = houses[!, :price]\n",
    "# y = houses[10] # equivalent\n",
    "scatter(x,y,markersize=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Houses with 0 square feet that cost money?*\n",
    "\n",
    "The square footage seems to not have been recorded in these cases. \n",
    "\n",
    "Filtering these houses out is easy to do!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filter_houses = houses[houses[!, :sq__ft] .> 0, :]  # dot broadcasting\n",
    "x = filter_houses[!, :sq__ft]\n",
    "y = filter_houses[!, :price]\n",
    "scatter(x,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This makes sense! The higher the square footage, the higher the price."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can filter a `DataFrame` by feature value too, using the `by` function.\n",
    "The `mean()` function comes from the `Statistics` module in the standard library, which we get from running `using Statistics` at the top of this file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "by(filter_houses,:type,size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "by(filter_houses,:type,filter_houses->mean(filter_houses[!, :price]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's do some kmeans clustering on this data.\n",
    "\n",
    "First, we can load the `Clustering` package to do this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Pkg.add(\"Clustering\")\n",
    "using Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us see how `Clustering` works with a generic example first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make a random dataset with 1000 points\n",
    "# each point is a 5-dimensional vector\n",
    "J = rand(5, 1000)\n",
    "R = kmeans(J, 20; maxiter=200, display=:iter) \n",
    "# performs K-means over X, trying to group them into 20 clusters\n",
    "# set maximum number of iterations to 200\n",
    "# set display to :iter, so it shows progressive info at each iteration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's get back to the problem in hand and see how this can be applied over there.\n",
    "\n",
    "Let's store the features `:latitude` and `:longitude` in an array `X` that we will pass to `kmeans`.\n",
    "\n",
    "First we add data for `:latitude` and `:longitude` to a new `DataFrame` called `X`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = filter_houses[!, [:latitude,:longitude]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and then we convert `X` to an `Array` via\n",
    "\n",
    "```julia\n",
    "X = Array(X)\n",
    "```\n",
    "or\n",
    "```julia\n",
    "X = convert(Array, X)\n",
    "```\n",
    "\n",
    "Since we know this array has no missing values, we can also change the output type of the array to just Float64s, which we'll need for Clustering below:\n",
    "\n",
    "```julia\n",
    "X = Array{Float64}(X)\n",
    "```\n",
    "or\n",
    "```julia\n",
    "X = convert(Array{Float64}, X)\n",
    "```\n",
    "to turn `X` into an `Array` that stores `Float64`s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = Array{Float64}(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now take the transpose of `X` using the `transpose()` function. A transpose is required\n",
    "since `kmeans()` function takes each row as a `feature`, and each column a `data point`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = transpose(X)\n",
    "#X = X'  # (conjugate transposition) also does the same thing (but only for real-valued arrays).\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a first pass at guessing how many clusters we might need, let's use the number of zip codes in our data.\n",
    "\n",
    "(Try changing this to see how it impacts results!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "k = length(unique(filter_houses[!, :zip]))\n",
    "# there should be atleast 2 distinct features (k>=2) to group the data points\n",
    "println(\"unique zip codes are \",k)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `kmeans` function to do kmeans clustering!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Clustering\n",
    "C = kmeans(X, k) # try changing k"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create a new data frame, `df`, with all the same data as `filter_houses` that also includes a column for the cluster to which each house has been assigned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = DataFrame(cluster=C.assignments, city=filter_houses[!, :city],\n",
    "    latitude=filter_houses[!, :latitude], longitude=filter_houses[!, :longitude], zip=filter_houses[!, :zip])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot each cluster as a different color."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters_figure = plot()\n",
    "for i = 1:k\n",
    "    clustered_houses = df[df[!, :cluster].== i,:]\n",
    "    xvals = clustered_houses[!, :latitude]\n",
    "    yvals = clustered_houses[!, :longitude]\n",
    "    scatter!(clusters_figure,xvals,yvals,markersize=4)\n",
    "end\n",
    "xlabel!(\"Latitude\")\n",
    "ylabel!(\"Longitude\")\n",
    "title!(\"Houses color-coded by cluster\")\n",
    "display(clusters_figure)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now let's try coloring them by zip code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unique_zips = unique(filter_houses[!, :zip])\n",
    "zips_figure = plot()\n",
    "for uzip in unique_zips\n",
    "    subs = filter_houses[filter_houses[!, :zip].==uzip,:]\n",
    "    x = subs[!, :latitude]\n",
    "    y = subs[!, :longitude]\n",
    "    scatter!(zips_figure,x,y)\n",
    "end\n",
    "xlabel!(\"Latitude\")\n",
    "ylabel!(\"Longitude\")\n",
    "title!(\"Houses color-coded by zip code\")\n",
    "display(zips_figure)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see the two plots side by side."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot(clusters_figure,zips_figure,layout=(2, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not exactly! but almost... Now we know that ZIP codes are not randomly assigned!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 2: Nearest Neighbor with a KDTree\n",
    "\n",
    "For this example, let's start by loading the `NearestNeighbors` package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using NearestNeighbors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this package, we'll look for the `knearest` neighbors of one of the houses, `point`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "knearest = 10\n",
    "id = 70 # try changing this\n",
    "point = X[:,id]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can build a `KDTree` and use `knn` to look for `point`'s nearest neighbors!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kdtree = KDTree(X)\n",
    "idxs, dists = knn(kdtree, point, knearest, true)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll first generate a plot with all of the houses in the same color,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = filter_houses[!, :latitude];\n",
    "y = filter_houses[!, :longitude];\n",
    "scatter(x,y);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and then overlay the data corresponding to the nearest neighbors of `point` in a different color."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = filter_houses[idxs,:latitude];\n",
    "y = filter_houses[idxs,:longitude];\n",
    "scatter!(x,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are those nearest neighbors in red!\n",
    "\n",
    "We can see the cities of the neighboring houses by using the indices, `idxs`, and the feature, `:city`, to index into the `DataFrame` `filter_houses`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cities = filter_houses[idxs,:city]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 3: PCA for dimensionality reduction\n",
    "\n",
    "Let us try to reduce the dimensions of the price/area data from the houses dataset.\n",
    "\n",
    "We can start by grabbing the square footage and prices of the houses and storing them in an `Array`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "F = filter_houses[!, [:sq__ft,:price]]\n",
    "F = convert(Array{Float64,2},F)'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall how the data looks when we plot housing prices against square footage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scatter(F[1,:],F[2,:])\n",
    "xlabel!(\"Square footage\")\n",
    "ylabel!(\"Housing prices\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use the `MultivariateStats` package to run PCA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pkg.add(\"MultivariateStats\")\n",
    "using MultivariateStats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `fit` to fit the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "M = fit(PCA, F)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that you can choose the maximum dimension of the new space by setting `maxoutdim`, and you can change the method to, for example, `:svd` with the following syntax.\n",
    "\n",
    "```julia\n",
    "fit(PCA, F; maxoutdim = 1,method=:svd)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems like we only get one dimension with PCA! Let's use `transform` to map all of our 2D data in `F` to `1D` data with our model, `M`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = transform(M, F)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use `reconstruct` to put our now 1D data, `y`, in a form that we can easily overlay (`Xr`) with our 2D data in `F` along the principle direction/component."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Xr = reconstruct(M, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we create that overlay, where we can see points along the principle component in red. \n",
    "\n",
    "(Each blue point maps uniquely to some red point!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scatter(F[1,:],F[2,:])\n",
    "scatter!(Xr[1,:],Xr[2,:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 4: Learn how to build a simple multi-layer-perceptron on the MNIST dataset\n",
    "\n",
    "MNIST from: https://github.com/FluxML/model-zoo/blob/master/mnist/mlp.jl\n",
    "\n",
    "Let's start by loading `Flux`, importing a few things from `Flux` explicitly, and bringing the `repeated` function into our scope."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Flux, Flux.Data.MNIST\n",
    "using Flux: onehotbatch, argmax, crossentropy, throttle\n",
    "using Base.Iterators: repeated"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now store all the MNIST images in `imgs` and take a peak into this vector to see what the data looks like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "imgs = MNIST.images()\n",
    "imgs[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the type of an individual image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "typeof(imgs[3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Reorganizing our array of images\n",
    "\n",
    "We see this is a 2D array that stores `ColorTypes`. To work more easily with this data, let's convert all `ColorTypes` to floating point numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fpt_imgs = float.(imgs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can see what `imgs[3]` looks like as an array of floats, rather than as an array of colors!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fpt_imgs[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's stack the images to create one large 2D array, `X`, that stores the data for each image as a column.**\n",
    "\n",
    "To do this, we can **first** use `reshape` to unravel each image, creating a 1D array (`Vector`) of floats from a 2D array (`Matrix`) of floats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unraveled_fpt_imgs = reshape.(fpt_imgs, :);\n",
    "typeof(unraveled_fpt_imgs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Note that `Vector` is an alias for a 1D `Array`.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This makes `unraveled_fpt_imgs` a `Vector` of `Vector`s where `imgs[3]` is now"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unraveled_fpt_imgs[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After using `reshape` to get a `Vector` of `Vector`s, we can use `hcat` to build a `Matrix`, `X`, from `unraveled_fpt_imgs` where the `Vector`s stored in `unraveled_fpt_imgs` will become the columns of `X`.\n",
    "\n",
    "Note that we're using the \"splat\" command below, `...`, which allows you to pass all the elements of an object to a function, rather than just passing the object itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = hcat(unraveled_fpt_imgs...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### How to go back to images from this 2D `Array`\n",
    "\n",
    "So now each column in X is an image reshaped to a vector of floating points. Let's pick one column and see what the digit is.\n",
    "\n",
    "Let's try to view the second image in the original array, `imgs`, by taking the second column of `X`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "onefigure = X[:,2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll `reshape` this array to a 2D, 28x28 array,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t1 = reshape(onefigure,28,28)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and finally use `colorview` from the `Images` package to view the handwritten digit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colorview(Gray, t1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Our data is in working order!*\n",
    "\n",
    "For our machine to learn the digit with which each image is associated, we'll need to train it using correct answers. Therefore we'll make use of the `labels` associated with these images from MNIST."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = MNIST.labels() # the true labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One-hot-encode the labels with `onehotbatch`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y = onehotbatch(labels, 0:9)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "which gives a binary indicator vector for each figure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build the network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "m = Chain(\n",
    "  Dense(28^2, 32, relu),\n",
    "  Dense(32, 10),\n",
    "  softmax)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the loss functions and accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loss(x, y) = Flux.crossentropy(m(x), y)\n",
    "accuracy(x, y) = mean(argmax(m(x)) .== argmax(y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "methodswith(typeof(ps))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `X` to create our training data and then declare our evaluation function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "datasetx = repeated((X, Y), 200)\n",
    "evalcb = () -> @show(loss(X, Y))\n",
    "ps = Flux.params(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far, we have defined our training data and our evaluation functions.\n",
    "\n",
    "Let's take a look at the function signature of Flux.train!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "?Flux.train!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Now we can train our model and look at the accuracy thereafter.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "opt = ADAM()\n",
    "Flux.train!(loss, ps, datasetx, opt, cb = throttle(evalcb, 10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "accuracy(X, Y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've trained our model, let's create test data, `tX`, "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tX = hcat(float.(reshape.(MNIST.images(:test), :))...)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and run our model on one of the images from `tX`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_image = m(tX[:,1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "argmax(test_image) - 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The largest element of `test_image` is the 8th element, so our model says that test_image is a \"7\".\n",
    "\n",
    "Now we can look at the original image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using Images\n",
    "t1 = reshape(tX[:,1],28,28)\n",
    "colorview(Gray, t1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and there we have it!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example 5: Linear regression in Julia (we will write our own Julia code and Python code)\n",
    "\n",
    "Let's try to find the best line fit of the following data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xvals = repeat(1:0.5:10, inner=2)\n",
    "yvals = 3 .+ xvals .+ 2 .* rand(length(xvals)) .-1\n",
    "scatter(xvals, yvals, color=:black, leg=false)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to fit a line through this data.\n",
    "\n",
    "Let's write a Julia function to do this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "function find_best_fit(xvals,yvals)\n",
    "    meanx = mean(xvals)\n",
    "    meany = mean(yvals)\n",
    "    stdx = std(xvals)\n",
    "    stdy = std(yvals)\n",
    "    r = cor(xvals,yvals)\n",
    "    a = r*stdy/stdx\n",
    "    b = meany - a*meanx\n",
    "    return a,b\n",
    "end"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To fit the line, we just need to find the slope and the y-intercept (a and b).\n",
    "\n",
    "Then add this fit to the existing plot!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a,b = find_best_fit(xvals,yvals)\n",
    "ynew = a .* xvals .+ b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot!(xvals,ynew)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's generate a much bigger dataset,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xvals = 1:100000;\n",
    "xvals = repeat(xvals,inner=3);\n",
    "yvals = 3 .+ xvals .+ 2 .* rand(length(xvals)) .- 1;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@show size(xvals)\n",
    "@show size(yvals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and now we can time how long it takes to find a fit to this data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@time a,b = find_best_fit(xvals,yvals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will write the same code using Python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using PyCall\n",
    "using Conda"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "py\"\"\"\n",
    "import numpy\n",
    "def find_best_fit_python(xvals,yvals):\n",
    "    meanx = numpy.mean(xvals)\n",
    "    meany = numpy.mean(yvals)\n",
    "    stdx = numpy.std(xvals)\n",
    "    stdy = numpy.std(yvals)\n",
    "    r = numpy.corrcoef(xvals,yvals)[0][1]\n",
    "    a = r*stdy/stdx\n",
    "    b = meany - a*meanx\n",
    "    return a,b\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "find_best_fit_python = py\"find_best_fit_python\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xpy = PyObject(xvals)\n",
    "ypy = PyObject(yvals)\n",
    "@time a,b = find_best_fit_python(xpy,ypy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's use the benchmarking package to time these two.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "using BenchmarkTools"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@btime a,b = find_best_fit_python(xvals,yvals)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@btime a,b = find_best_fit(xvals,yvals)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Julia 1.0.5",
   "language": "julia",
   "name": "julia-1.0"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.0.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
